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Abstract 

Background: The human malaria parasite Plosmodium folciporum survives pressures from the host immune system 
and antimalarial drugs by modifying its genome. Genetic recombination and nucleotide substitution are the two 
major mechanisms that the parasite employs to generate genome diversity. A better understanding of these 
mechanisms may provide important information for studying parasite evolution, immune evasion and drug 
resistance. 

Results: Here, we used a high-density tiling array to estimate the genetic recombination rate among 32 progeny 
of a P. folciporum genetic cross (7G8 x GB4). We detected 638 recombination events and constructed a high- 
resolution genetic map. Comparing genetic and physical maps, we obtained an overall recombination rate of 9.6 
kb per centimorgan and identified 54 candidate recombination hotspots. Similar to centromeres in other 
organisms, the sequences of P. folciporum centromeres are found in chromosome regions largely devoid of 
recombination activity. Motifs enriched in hotspots were also identified, including a 12-bp G/C-rich motif with 3-bp 
periodicity that may interact with a protein containing 1 1 predicted zinc finger arrays. 

Conclusions: These results show that the P. folciporum genome has a high recombination rate, although it also 
follows the overall rule of meiosis in eukaryotes with an average of approximately one crossover per chromosome 
per meiosis. GC-rich repetitive motifs identified in the hotspot sequences may play a role in the high 
recombination rate observed. The lack of recombination activity in centromeric regions is consistent with the 
observations of reduced recombination near the centromeres of other organisms. 
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Background 

The human malaria parasite Plasmodium falciparum 
kills approximately one million people each year, mostly 
children in Africa [1]. The goal of developing an effec- 
tive vaccine to control infection or disease has yet to be 
met. Parasite resistance to multiple antimalarial drugs 
has also spread rapidly in recent years. Genome plasti- 
city and genetic variation are significant challenges to 
vaccine development and contribute to the worldwide 
problem of drug resistance. 

The P. falciparum malaria parasite has a unique and 
complex life cycle involving multiple DNA replications 
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both in the mosquito and in human hosts. Except for a 
brief diploid phase after mating events in the mosquito 
midgut, the parasite stages in both hosts are haploid. 
Human infection commences with the injection of spor- 
ozoite stages by the bite of an infectious mosquito; asex- 
ual sporozoites then travel to the liver where they 
produce tens of thousands of merozoites after multiple 
rounds of DNA replication. The mature merozoites are 
released from the hepatocytes and invade red blood 
cells. Within red blood cells, individual merozoites will 
replicate their DNA 4 to 5 times within 48 hours and 
release 16 to 32 daughter merozoites back into the 
blood stream to infect other red blood cells. This ery- 
throcytic cycle is responsible for the clinical manifesta- 
tions of malaria and can continue until the infection is 
eliminated by the host immune response or cleared by 
antimalarial drug treatment. While the erythrocytic 
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cycle produces millions of haploid asexual parasites, a 
small proportion of the parasites differentiates into male 
and female sexual stages - termed gametocytes - that 
circulate in the bloodstream. When the gametocytes are 
taken up by a feeding mosquito during a blood meal, 
they develop into male and female gametes, mate, and 
form a diploid zygote that develops into an ookinete; 
genetic recombination and meiosis occur at this time 
[2]. The motile ookinete subsequently develops into an 
oocyst containing thousands of sporozoites after rounds 
of mitotic divisions. Completion of the life cycle there- 
fore offers many opportunities for genetic recombination 
and mutation events during numerous rounds of DNA 
replication. 

Genetic recombination can generate novel beneficial 
alleles, or combinations of alleles, that can spread 
through the population driven by positive selection 
[3,4]. In P. falciparum, recombination rates (RRs) vary 
not only among parasite populations but also along 
parasite chromosomes, which exhibit regions of elevated 
or reduced recombination [5,6]. Many factors can influ- 
ence estimates of RR (or more precisely outcrossing 
rate), including the intensity of transmission by mosqui- 
toes, diversity of local parasite populations, the number 
of genetic markers used in the analysis, and chromoso- 
mal locations of specific DNA sequences [7]. Some of 
these factors may help explain the different estimates of 
recombination rates obtained from two genetic crosses 
[8,9]. 

To better understand the mechanism of genetic 
recombination that underlies P. falciparum evolution 
and its response to host immunity and drug pressure, 
we have used a high-density tiling microarray to investi- 
gate the genotypes of progeny obtained from a P. falci- 
parum cross (7G8 x GB4) [9]. Here we show that the 
P. falciparum parasite has a relatively high RR and iden- 
tify putative recombination hotspots with conserved 
motifs that may mediate frequent recombination in the 
parasite. The high RR may provide the genetic basis for 
the parasite to rapidly adapt to a hostile environment 
and to evade host immunity and drug action. 

Results 

Single feature polymorphism detection and genotype 
verification 

Applying the single-feature polymorphism (SFP) calling 
parameters described previously [10], we identified 5,672 
putative SFPs that differed between the GB4 and 3D7 
parasites, 11,892 putative SFPs that differed between 
7G8 and 3D7, and 9,030 putative SFPs (approximately 
0.5% of total unique probes) that were the same 
between GB4 and 7G8 but differed from 3D7. After 
accounting for the redundancy of probes within 25 bp 
that detect the same polymorphisms, and excluding 



single probe calls as well as ambiguous calls from subte- 
lomeric repetitive sequences and multigene families, we 
obtained 4,335 multiprobe SFPs (mSFPs) distinguishing 
the 7G8 from GB4 parasites. Potential errors in geno- 
type calls were corrected using the procedures described 
in Materials and methods, leading to 3,184 high-quality 
mSFPs (Table 1). Interestingly, there were approximately 
twice as many unique probes distinguishing the 3D7 
parasite from 7G8 than from GB4, suggesting a larger 
genetic distance between 7G8 and 3D7 than between 
GB4 and 3D7 (Table 1). In addition to mSFPs, we also 
detected copy number variations between the GB4 and 
7G8 parasites. We detected and mapped 295 differential 
segments that were at least 500 bp long (0.5 to 21.015 
kb) to 340 genes/regions, although the majority (>90%) 
of the signals for the copy number variations were from 
regions containing highly polymorphic antigen gene 
families (Additional file 1). 

Although we applied strict standards in calling mSFPs, 
there were still regions with double crossovers within 
relatively small segments that were likely due to geno- 
type calling errors or possible gene conversions (Addi- 
tional file 2); some of the errors became apparent only 
after multiple consecutive mSFPs were examined simul- 
taneously. To ensure correct genotype calls, we imple- 
mented computational correction protocols (see 
Materials and methods) and compared the inherited 
mSFP genotypes with 8,097 genotypes from 254 micro- 
satellite (MS) markers (32 progeny x 254 MS markers = 
8,128 minus 31 missing data points) [9]. Results identi- 
fied only 31 mismatches, defined as one or two adjacent 
MS markers flanked by mSFP genotypes of different 
alleles, between the MS and mSFP genotypes (Addi- 
tional file 3). These mismatches were from 19 MSs and 
were mostly single MS genotypes flanked by multiple 
mSFP genotypes of different alleles, suggesting potential 
errors from MS typing or spontaneous changes in the 
MS repeats. The high percentage of genotype match 
between MS and SFP genotypes (8,066/8,097 or 99.6%) 
provided good confidence on the data supporting the 
final SFP genotype calls. Although the MSs provided 
relatively good coverage across the genome, there were 
large segments on chromosomes 1, 2, 3, 7, 8, 9, 10, and 
11 that did not have MS coverage (Additional files 2 
and 3). Our mSFPs therefore greatly improve the cover- 
age of genetic markers across the 14 chromosomes. 

To further verify the genotype calls and clarify the 
mismatches, we re-typed the 19 MSs that produced 31 
mismatches between MSs and mSFPs. The typing 
results corrected 26 MS genotyping errors (Additional 
file 3), bringing the correct genotype match rate to 
99.9%. We also randomly selected 14 regions of approxi- 
mately 100 kb or less with two to four mSFPs that pre- 
dicted putative double crossovers in 21 progeny and 
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Table 1 Microarray probes and genotype calls from GB4 and 7G8 comparing with those of 3D7 
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Chr., chromosome; Total, total numbers of probes on each chromosome; 0_0, numbers of probes calling the same alleles as those of 3D7; 1_1, numbers of 
probes calling alleles different from 3D7 in both GB4 and 7G8; 0_1, numbers of probes calling GB4 unique alleles; 1_0, numbers of probes calling 7G8 unique 
alleles; Diff SFP, numbers of differential probes from both1_0 and 0_1; Diff mSFP, multiple-probe SFPs collapsed within 25 bp and excluded calls from 
subtelomeric regions and those in multigene families; % Diff, percentage of Diff SFP probes divided by the total number of probes; 7G8/GB4, the ratio of unique 
7G8 probes over those of GB4; Corrected mSFP, numbers of mSFPs after removing false calls (see Materials and methods for details); Number of MS, number of 
microsatellites used in this study. 



single crossovers in five progeny. We designed 35 PCR 
primer pairs to detect MS polymorphisms informative 
for these single and double crossover segments (Addi- 
tional file 3). Of the 34 primer pairs, 27 were poly- 
morphic between the parents. For the five single 
crossover events, the crossovers were all verified to be 
correct after typing the progeny; however, only one of 
the putative double crossovers in the 21 progeny could 
be verified, suggesting that the majority of the putative 
double crossovers predicted by two to four mSFP mar- 
kers and not removed by our filtering processes were 
false. The two markers flanking the only correctly iden- 
tified double crossover DNA segment spanned 81 kb 
with six SFP markers inside the crossover segment. For 
the remaining 20 double crossovers, 18 had flanking 
markers spanning less than 60 kb, except two that had 
flanking markers spanning 85 and 108 kb, respectively 
(Additional file 3). A search of the entire genome identi- 
fied 176 putative double crossover segments within 
<60 kb (Additional files 3 and 4). Based on this informa- 
tion, we corrected the genotypes of the 176 putative 
double crossover segments flagged by flanking markers 
spanning <60 kb and containing fewer than five mSFPs 
in the segment. 

Crossover counts and bias inheritance 

From the filtered inherited genotypes of the progeny 
(after correction for spurious double crossovers), we 
identified 638 crossovers (Table 2). Plots of genotype 
inheritance patterns for each of the 14 chromosomes 
showed relatively even (1:1) genotype inheritance from 



each parent, except for a few chromosomal regions that 
favored genotypes from one parent (Figure 1). As pre- 
viously noted, one example of inheritance bias was the 
dominant inheritance of the 7G8 haplotype among the 
progeny at one end of chromosome 7 [9] (Figure 1). 
Other regions that had a bias toward the 7G8 genotypes 
were found in parts of chromosomes 6, 8, and 13; and 
regions that favored GB4 genotypes could be found on 
chromosomes 3, 5, 7, and 11. Inheritance bias was also 
found in subtelomeric regions on chromosomes 2, 3, 4, 
5, 6, 8, 9, and 11. Except for the bias at the ends of 
chromosome 3, 6, and 7, the bias inheritances did not 
significantly deviate from the 1:1 ratio (Bonferroni cor- 
rected P < 0.01; Figure 1). Plots of crossover events 
across the physical chromosomes revealed regions with 
clusters of recombination events, many of which located 
at subtelomeric regions, suggesting potential recombina- 
tion hotspots (Figure 1). Interestingly, the putative cen- 
tromeres [11] were mostly located in chromosome 
regions without crossovers or recombination coldspots 
(Figure 1; Additional file 3), consistent with the observa- 
tions for the centromeres of plants, yeast and other 
organisms [12-14]. Significant bias inheritance was pre- 
viously found in the Dd2 x HB3 cross [8]; and the cen- 
tromeres are mostly located in regions with little or no 
recombination activity in this cross too (Additional files 
5 and 6). Similar to the previous observation of a strong 
positive correlation between crossover counts and physi- 
cal chromosome length among the progeny of the Dd2 
x HB3 cross [8], there was also a positive correlation 
between crossover counts and physical chromosome 
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Table 2 Estimates of genetic distance and recombination frequency for each chromosome 
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23,264.3 


21,764.4 


3,438 
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2,514.0/1,654.7 
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Chr., chromosome; Length (kb), chromosome length in kilobases; Marker span (kb), the distance between the first and last markers for each chromosome; 
Number of markers, total numbers of markers, including microsatellites; Number of crossovers, number of crossovers including subtelomeric regions (values 
marked with an asterisk are excluding subtelomeric regions); Gene distance (cM), genetic distance in centimorgans; kb/cM, kilobase per centimorgan obtained 
from the ratio between genetic distance and marker span. 



length in the progeny of the 7G8 x GB4 cross (correla- 
tion coefficient r = 0.85 after removing subtelomeric 
regions; Additional file 7). However, the crossover 
counts per megabase in a meiosis appear to be lower in 
the larger chromosomes (Additional file 8). 

Construction of a high-resolution linkage map 

We next constructed high-resolution genetic linkage 
maps of the 14 chromosomes with 3,184 mSFPs and 254 
MSs using the Haldane map function and compared the 
linkage maps with physical positions of the markers on 
the chromosomes. Genetic distances between the mar- 
kers and map units were calculated, and the physical 
positions of the markers on the chromosomes were iden- 
tified (Figure 2; Additional file 3). Because only 21.8 Mb 
of the 23 Mb P. falciparum genome was covered by our 
mSFPs, a total genetic distance of approximately 2,514.0 
cM produced an average map unit of approximately 9.6 
kb/cM (or 0.10 cM/kb), which is approximately 1.5 to 3.5 
times smaller than the previous estimates [8,9]. We also 
estimated the genetic distances using the Kosambi map 
function and compared the results with those from the 
Haldane map function. The resulting maps (2,490 cM 
Kosambi; 2,514 cM Haldane) were nearly identical, sug- 
gesting little or no recombination interference at gen- 
ome-wide scale. Plots of coefficient coincidence (Z) [15] 
suggested weak crossover interference in some chromo- 
somes. However, the power to detect meiotic crossover 
interference was limited due to the small number of 
meiosis/progeny (Additional file 9). 

Notably, chromosomes 2, 3, 4, 8, and 9 showed rela- 
tively high average RR, whereas the three largest 



chromosomes (12 to 14) showed lower recombination 
rates (Table 2 and Figure 2); however, the high RR in 
the five chromosomes included activity of potential 
recombination hotspots at chromosome ends (Figure 2). 
Removing the hotspots at the ends of these chromo- 
somes greatly reduced the estimates of genetic distances 
for the chromosomes and increased the map unit to 
12.8 kb/cM (Table 2; Additional file 10), which is 
slightly less than the previous estimate of 15 kb/cM 
from the Dd2 x HB3 cross [8]. 

Detection of recombination hotspots 

We applied an overlapping sliding window of 5 kb to 
scan through the markers on each chromosome for 
potential hotspots among the 32 progeny and used Hal- 
dane map function to estimate RR and confidence inter- 
vals. We compared each sliding window's RR with the 
averaged RR estimates across all 14 chromosomes (0.13 
cM/kb). A region was considered a recombination hot- 
spot if there were two or more recombination events 
across the 32 progeny and the estimated RRs were more 
than five-fold higher than the average genome-wide RR. 
These analyses identified 54 segments containing puta- 
tive recombination hotspots in sequences of 298 bp to 
100.4 kb long, including 11 hotspots from subtelomere 
sequences (Figure 3; Additional file 11). All hotspots 
contained low-complexity regions with tandem repeats 
(except for one) according to the classification at Plas- 
moDB [16] and most of them (except for two) were 
within protein coding regions (Additional file 11). Some 
of the hotspot sequences contained genes encoding sur- 
face antigens, protein kinases, and AP2 transcription 
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Figure 1 Recombination events and 7G8 allele frequency along each of the 14 P. falciparum chromosomes. Each panel represents one 
chromosome as marked (chr). Recombination events (black vertical lines) are the number of changes in inheritance pattern (parental allelic type) 
between two adjacent markers among 32 progeny, and 7G8 allele frequency is the proportion of 7G8 alleles among the 7G8 x GB4 progeny 
(red curves). The arrowheads under each panel indicate the positions of putative centromeres for the 14 chromosomes according to [1 1]. The 
dashed horizontal lines delimit the significant inheritance bias from 1:1 segregation. 



factor (Additional file 11). Indeed, genes with Gene 
Ontology terms of antigenic variation, defense response, 
and cell-to-cell interactions were significantly enriched 
(P < 0.01) in the hotspot sequences (Additional file 12). 
Approximately 20% of the hotspots were located at 
chromosome ends or subtelomeric regions defined pre- 
viously [17] (Figure 3), consistent with those inferred 
from SNP studies on field isolates [5]. 

For comparison, we also mapped the 720 MS markers 
(excluding those that could not be mapped due to the 
absence of primer sequences in the current 3D7 genome 
sequence or had positions conflicting with the physical 
genome positions) typed on 35 progeny of the Dd2 x 
HB3 cross to the completed 3D7 chromosomes [8] and 
applied the same criteria to estimate RR and to detect 



recombination hotspots (Additional file 6). We obtained 
an estimate of RR of 12.1 kb/cM if we arranged all the 
MSs according to their positions on physical chromo- 
somes and identified 17 hotspots (Figure 3; Additional 
file 11). All of the hotspots but one are nonsubtelomeric 
because MS markers generally do not cover subtelo- 
meric regions. Only one hotspot region on chromosome 
11 (1,707,326 to 1,743,250 bp) from the Dd2 x HB3 
cross overlapped with those from the 7G8 x GB4 cross 
(1,707,250 to 1,717,037 bp). 

DNA sequences coding for protein low-complexity 
regions (pLCRs) have also been associated with elevated 
recombination [18,19]. These high-GC content minisa- 
tellite pLCRs are found throughout the P. falciparum 
genome [19]. We examined the nonsubtelomeric 
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Chromosome 

Figure 2 Physical and genetic maps of the 14 P. falciparum chromosomes in the 7G8 x GB4 cross. The vertical red scale lines on the left 
indicate genetic distance in centimorgans, and the blue vertical lines on the right are the physical distance in kilobases. Thin grey lines connect 
the genetic position of each marker (3,184 mSFP and 254 microsatellite markers) with their mapped physical positions on the chromosome. 
Please see Additional file 3 for detailed information. Note the elevated recombination activities at chromosome ends, particularly those at ends 
of chromosomes 2, 3, 4, 8, and 9, which increase the estimate of genome-wide recombination rate. Maps after removing subtelomeric markers 
are shown in Additional file 10. The arrowheads on the right side of the blue vertical lines indicate the positions of putative centromeres for the 
14 chromosomes according to [1 1]. 



hotspots for recombinogenic pLCRs. We found 427 
regions; however, only one hotspot contained one of 
these high-GC pLCR regions (found on chromosome 9, 
in gene PFI0685w, annotated as a putative pseudouridy- 
late synthase). 

Motifs enriched in recombination hotspots 

Repetitive DNA sequences such as minisatellites have 
been associated with recombination hotspots and gen- 
ome instability in humans [7]. To search for motifs 
associated with P. falciparum recombination, we ana- 
lyzed DNA sequences within the hotspot sequences 
using the MEME motif discovery toolkit [20]. We 
searched for enriched motifs in nonsubtelomeirc and 
subtelomeric hotspot sequences smaller than 15 kb from 
the two crosses and identified three GC-rich motifs that 
were enriched in the hotspot sequences (Figure 4; 



Additional file 13), including one 21-bp motif (TA[TA] 
GTTAGT[CG]AAG[TG]TAAGACC) (Figure 4a) from 
subtelomeric hotspots that is similar to the Rep20 
sequence implicated in recombination activity of chro- 
mosome subtelomeric regions [21-23]. Another enriched 
motif from subtelomeric hotspots was a 12-bp GC-rich 
sequence containing GCA[TC] [CA] [TG] AG[GT]TGC 
(Figure 4b). A 12-bp G-rich (or C-rich on reverse 
strand) motif ([TG]GA[TA]GAAG[AG] [TG]GA) was 
also identified from the nonsubtelomeric hotspot 
sequences (Figure 4c). The Rep20 related motif was pre- 
sent in the majority (64%) of the subtelomeric hotspots 
(almost none in nonsubtelomeric hotspots) and was sig- 
nificantly enriched (P = 0.014) compared to matched 
coldspot sequences (Additional file 13). The 12-bp sub- 
telomeric motif is present in approximately 80% of hot- 
spots and trends to higher frequency in hotspots relative 
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Figure 3 Plot of recombination rate of the 7G8 x GB4 cross 
showing recombination hotspots along each of the 14 P. 
falciparum chromosomes. A region was considered a 
recombination hotspot (asterisks) if there were two or more 
recombination events across the 32 progeny and the estimated 
recombination rates were higher than the chromosome-wide 
average rate by five-fold or more. The 14 chromosomes are as 
marked and separated with the vertical dashed lines. The black 
asterisks are hotspots from the 7G8 x GB4 cross, and the red 
asterisks indicate hotspot positions from the Dd2 x HB3 cross. The 
arrowheads under each panel indicate the positions of putative 
centromeres for the 14 chromosomes according to [1 1]. 



to coldspots (P = 0.055) and to the genome average (P = 
0.077) (Additional file 13). The 12-bp G/C-rich nonsub- 
telomeric motif contained a 3-bp repeat that might be 
the binding site of DNA binding proteins and was pre- 
sent in approximately 30% of nonsubtelomeric hotspots, 
a frequency significantly higher than those in matched 
coldspot sequences (P = 0.002) and in the genome aver- 
age (P = 6.0E-6). 

Since only 32 independent recombinant progeny were 
available for this study, a single crossover may represent a 
region with elevated recombination activity. We therefore 
searched all the crossover sites defined by marker intervals 
less than 5 kb, including 10 sequences from subtelomeric 
regions and 103 sequences from nonsubtelomeric regions. 
A 12-bp G-rich motif was detected in three of the ten sub- 
telomeric sequences (Figure 4d); and a 12-bp motif with 3- 
bp G periodicity detected in the nonsubtelomeric regions 
was essentially the same as the one observed in the non- 
subtelomeric hotspots (Figure 4e). Both motifs were pre- 
sent at significantly higher frequency (P < 0.05) than that 
of the genome average, although the 12-bp nonsubtelo- 
meric motif did not have significantly higher frequency 
than those in coldspot controls. 

Sequences with AT repeats or A/T tracks were found 
in almost all the hotspot sequences (data not shown). A 



(a) 



S 1 



i n m 





S 1 
m 



1 

m 



kTiATITsA 



Figure 4 Motifs enriched in recombination hotspot sequences. 

Motifs were identified from hotspot sequences using onr (any 
number of repetitions) in MEME. (a) A 21 -bp motif from 
subtelomeric hotspot sequences of the 7G8 x GB4 cross that is 
similar to that of the Rep20 repeat reported previously [21-23]. (b) A 
GC-rich 12-bp repeat from subtelomeric hotspot sequences of the 
7G8 x GB4 cross, (c) A 12-bp G/C-rich motif from the non- 
subtelomeric combined hotspot sequences of both crosses. (d,e) 
Two 12-bp G/C-rich motifs from subtelomeric and nonsubtelomeric 
sequences with at least one crossover within 5-kb interval, 
respectively, from the 7G8 x GB4 cross. 



search of DNA sequences in the hotspots using the oops 
(one-occurrence-per-sequence) function in the MEME 
program for motifs that occur once in each hotspot 
sequence identified polyA, polyT, and (TA) n repeats 
(data not shown); however, the frequencies of these AT 
repeats or A/T tracks in the hotspot sequences were not 
significantly different from those in the genome or 
matched coldspot sequences (Additional file 13). 

Discussion 

We used a high-density tiling array and the parents and 
progeny from a genetic cross to investigate genetic RR 
and recombination hotspots in the P. falciparum malaria 
parasite. Our results show that P. falciparum has a 
higher RR than previously reported [8,9]. In a recent 
study, the RR of the 7G8 x GB4 cross was estimated to 
be approximately 36 kb/cM using genotypes from a lim- 
ited set of 285 MS markers [9]; in another study, the RR 
of the Dd2 x HB3 cross (35 progeny) was estimated to 
be 17 kb/cM (14.8 kb/cM if using the corrected 23 Mb 
genome size) [8]. A similar estimate (13.7 kb/cM) was 
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obtained from 28 independent progeny of a rodent 
malaria parasite {Plasmodium c. chabaudi) cross that 
were typed with 614 amplified fragment-length poly- 
morphisms [24]. Our higher RR estimate is largely due 
to the inclusion of the highly recombinogenic subtelo- 
meric sequences. If we remove the crossover counts 
from the subtelomeric regions, the estimated RR in the 
7G8 x GB4 cross is 12.8 kb/cM (Table 2). This estimate 
is essentially the same as the one estimated from the 
Dd2 x HB3 cross (12.1 kb/cM) using the same methods 
employed in this study. The estimated RR of P. falci- 
parum is comparable to that of Cryptosporidium par- 
vum (10 to 56 kb/cM) [25], but is much higher than the 
estimated RR of Toxoplasma gondii (104 kb/cM) [26], 
rat (1.8 Mb/cM), mouse (1.9 Mb/cM), or human (0.8 
Mb/cM) [27]. 

After data processing and experimental verification of 
genotypes, our SFP genotypes matched well (99.95%) 
with those from 254 MSs. Comparison of our SFP geno- 
types with 8,097 MS genotypes showed that the number 
of mismatched genotypes between the two data sets was 
small (four mismatches or 0.05%). In theory, these four 
mismatches in genotypes between MS and mSFPs could 
be due to genotype calling errors from either the tiling 
array or MS typing. The mismatches could also be true 
differences in genotype as the mSFPs and MSs were 
located at slightly different positions on the chromo- 
somes. We recognize that our strict genotype calling 
processes may have excluded some gene conversion and 
ectopic recombination events, which are common 
between the paralogous loci of gene families [28]. High 
RR and recombination hotspots on chromosome 3 have 
also been observed in field populations, and no detect- 
able linkage disequilibrium was detected between mar- 
kers less than 1 kb apart in some African populations 
[5,29,30]. 

Various DNA sequences have been found to influence 
genetic recombination or to be associated with hotspots, 
including GC-rich DNA [31,32], repetitive minisatellites 
or MSs [33-36], and transcription factor binding sites 
[34]. In particular, a 13-mer C-rich degenerate motif 
(CCNCCNTNNCCNC) with a 3-bp periodicity sugges- 
tive of an interaction with zinc-finger DNA-binding pro- 
teins has been found to mediate recombination in 
human [32]. Additionally, imprinted chromosome 
regions generally have higher than average recombina- 
tion rates [37], and the relative activity of hotspots is 
also regulated by various factors that can directly or 
indirectly interact with these sequences [38]. In human 
and mouse, a protein (PRDM9) with a Kruppel asso- 
ciated box (KRAB), a histone methyl transferase domain 
(SET) and multiple zinc fingers was found to bind the 
C-rich 13-bp motif in hotspots and target the histone 
methylation activity to specific sites in the genome 



[39-41], The hotspot sequences we identified are also 
relatively GC-rich, cover coding regions, and carry repe- 
titive sequences (Additional file 11). We searched for 
motifs that might be associated with recombination hot- 
spots in P. falciparum. Several relatively GC-rich motifs 
were identified, including a 21 -bp motif that is similar 
to the Rep20 repeat that has been implicated in genetic 
recombination. The Rep20 family is among a number of 
gene families in subtelomeric regions that may have a 
role in antigenic variation [21-23,28]. As expected, the 
21 -bp motif was mostly from repetitive regions of subte- 
lomeric hotspots. Although the 13-bp GC-rich motif 
identified in the human genome by Myers et al. [7] was 
not found in our hotspots, we detected a 12-bp motif 
that is relatively G-rich from the P. falciparum genome 
(C-rich from the opposite strand). Significantly, the 12- 
bp nonsubtelomeric G/C-rich motif and the Rep20 
motif share a common feature with a 3- to 4-bp G peri- 
odicity that suggests a potential for interaction with 
zinc-finger DNA-binding proteins, similar to those of 
the 13-bp motif seen in the human genome [39-41]. A 
keyword search of the P. falciparum genome database 
[16] using 'zinc finger' found more than 200 zinc finger 
proteins in the P. falciparum genome, and a Blast search 
of the database using human PRDM9 identified a pro- 
tein (PFL0465c) with 11 predicted zinc fingers (Addi- 
tional file 14). PFL0465c has some conserved amino 
acids at the putative regions homologous to the KRAB 
and SET domains of PRDM9, but whether these regions 
have the expected activities remains unknown because 
the levels of homology are low. Interestingly, Genome- 
Net motif search [42] also identified a putative eukaryo- 
tic DNA topoisomerase I DNA binding domain in 
PFL0465c (Additional file 14). Prediction of DNA bind- 
ing of the protein using an online tool [43,44] showed 
significant P- values (P = 0.01 to 0.04, using polynomial 
kernels and 40% A, 40% T, 10% G and 10% C) for bind- 
ing to the motifs in Figure 4, although the SVM (sup- 
port vector machine) scores were all negative. Because 
the low predicted specificity of some zinc fingers and 
multiple combinations may contribute to DNA recogni- 
tion [39], whether the zinc fingers in PFL0465c can bind 
the DNA motifs we identified requires further investiga- 
tion. Since the non-coding regions of the P. falciparum 
genome are very AT-rich, it is not surprising to see that 
all the hotspot sequences, which are usually GC-rich, 
are found in the GC-rich coding regions. 

Similarly, AT-rich repeats were found in almost all the 
hotspot sequences. Monomeric A/T tracks have been 
associated with break points on chromosome 5 of P. fal- 
ciparum [45], and many MSs, particularly poly-purine/ 
poly-pyrimidine, have been associated with recombina- 
tion hotspots in the Saccharomyces cerevisiae genome 
[35]. However, the frequencies of the AT-rich repeats in 
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our hotspots were not significantly higher than the gen- 
ome average; the presence of these AT-rich motifs in 
hotspots could be simply due to the abundance of the 
AT-rich repeats in the parasite genome. The functional 
roles of these motifs in genetic variation require further 
investigation. 

It is interesting that the three largest chromosomes 
have low RRs. Higher RRs for smaller chromosomes - 
termed chromosome size-dependent control of meiotic 
reciprocal recombination - has been reported in 
humans, Saccharomyces cerevisiae, and other organisms 
[27,46]. This chromosome size-dependent recombina- 
tion was thought to be important for ensuring homolo- 
gous chromosome crossover during meiosis and to be 
caused by different amounts of crossover interference 
between the chromosomes [47]; however, a recent study 
suggested that differences in RR in budding yeast were a 
function of their DNA sequence, and not due to the size 
of the chromosome [48]. Although our smaller number 
of progeny has low power to detect crossover interfer- 
ence, evidence of interference, particularly in the large 
chromosomes, was detected. The observation of rela- 
tively high RR in some smaller chromosomes also 
appeared to be largely due to recombination hotspots at 
the chromosome ends. Higher RR in smaller chromo- 
somes was also observed in parasites collected from new 
Cambodian patients [6] and in the human genome [49]. 

Centromeres are characterized by high AT content 
and with little or no genetic recombination [12,13]. Eto- 
poside-mediated topoisomerase-II cleavage was recently 
employed to identify centromere locations in P. falci- 
parum [11]. Comparison of these locations with maps of 
the chromosome crossover sites shows that all of the 
centromeres are located in regions with little or no 
recombination activity (Figure 1; Additional file 3). The 
results are consistent with the observations of reduced 
recombination at centromere regions in other organ- 
isms, supporting the identity and locations of the P. fal- 
ciparum centromeres. Some crossovers were found at 
the centromeres in the Dd2 x HB3 cross (Additional file 
6), which can be partly explained by the lower density 
of genetic markers in this cross. 

Biased inheritance patterns were observed on some 
chromosomes, in particular, chromosomes 7, 8, 11, and 
13 (Figure 1). Most of the progeny inherited the 7G8 
allele at one end of chromosome 7. This observation 
suggests that inheritance of the 7G8 alleles in this 
region may provide a competitive advantage during pro- 
pagation in either the mosquito, chimpanzee (the pri- 
mate host used to passage the recombinant progeny 
through the liver cycle), or in tissue culture. Biased 
inheritance has also been observed in the Dd2 x HB3 
cross [8], but the reasons for the inheritance bias are 
still unknown. 



We did not find evidence that pLCR-mediated recom- 
bination is a driver of hotspot structure in the genetic 
cross. These pLCR recombinogenic regions typically are 
high-GC content minisatellite repeats found in protein- 
coding regions. Although these regions are recombino- 
genic when they occur in proteins [19], they are not sig- 
nificantly enriched in the hotspots found in our genetic 
cross, suggesting that recombination mediated by these 
regions is not a major driver in the mechanism of 
recombination. 

Conclusions 

We have constructed a high-resolution linkage map for 
a P. falciparum cross with 3,184 mSFPs and 254 MSs, 
providing a density of one genetic marker every approxi- 
mately 6.3 kb or every 0.7 cM, greatly improving the 
power to fine-map loci in genetic mapping studies. This 
study also represents the first investigation of recombi- 
nation hotspots using progeny from genetic crosses and 
the identification of motifs potentially associated with 
high recombination rate in malaria parasites. Interest- 
ingly, the 12 -bp motif identified in our study has a 3 -bp 
periodicity also found in the motif mediating recombi- 
nation in the human genome. Lack of recombination 
activity at the putative centromere sites is consistent 
with the characteristics of centromeres in other organ- 
isms. The high-resolution genetic map, the estimates of 
RR, and the conserved motifs detected in the hotspot 
sequences will greatly facilitate investigation of mechan- 
isms of genetic recombination and the role of genetic 
recombination in parasite diversity and survival. 

Materials and methods 

Parasites and parasite culture 

Thirty-two P. falciparum independent recombinant pro- 
geny from the 7G8 x GB4 cross and the two parental 
lines have been previously described [9]. Parasites were 
maintained in RPMI 1640 medium containing 5% human 
0 + erythrocytes (5% hematocrit), 0.5% Albumax (GIBCO, 
Life Technologies, Grand Island, NY, USA), 24 mM 
sodium bicarbonate, and 10 (ig/ml gentamicin at 37°C 
under an atmosphere of 5% C0 2 , 5% 0 2 , and 90% N 2 . 

Microarray Genechip®, DNA hybridization, and data 
normalization 

The PFSANGER Genechip® was purchased from Affy- 
metrix, Inc. (Santa Clara, CA, USA), and array hybridi- 
zation was performed at the microarray facility of the 
National Cancer Institute (Frederick, MD, USA). The 
probes on the array were designed based on P. falci- 
parum genome (3D7) sequence v2.1.1 covering genomic 
regions where unique probes with a reasonably broad 
thermal range could be designed. Because of recent 
updates of genome databases, all probe sequences were 
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reassigned to new coordinates along each chromosome 
according to the 3D7 genome sequence in PlasmoDB 
v6.0. DNA extraction, labeling, microarray hybridization, 
data collection, and normalization have been described 
[10]. Hybridized chips were washed and stained follow- 
ing the EukGE-WS2v5 protocol from Affymetrix and 
scanned at 570 nm emission wavelength using Affyme- 
trix scanner 3000. The scanned image CEL files were 
processed using the R/Bioconductor package and the 
robust multichip analysis method [50]. The programs 
retrieved individual probe hybridization signal, sub- 
tracted the background noise, quantile-normalized sig- 
nals across all chips, and log 2 transformed the data into 
a final data matrix. The raw and normalized data 
obtained for this publication have been deposited in 
NCBFs Gene Expression Omnibus and are accessible 
through GEO Series accession number [GEO:GSE25656] 
[51]. 

Single feature polymorphism and parental genotype 
assignment 

SFP calls were recorded and validated using an in-house 
perl script as described and validated previously [10]. 
An SFP was defined as reduction in signal intensity 
three-fold or greater than that from the reference 3D7 
genome regardless of the numbers or types of substitu- 
tions covered by a probe. A probe was assigned to be an 
SFP (T) if the signal reduction was at least three-fold 
(conservative to reduce false positive) that of 3D7 and 
no SFP ('0') was called if the signal fold change was less 
than 3.0. For each progeny, there were generally four 
different possible genotypes: both 7G8 and GB4 are the 
same as 3D7, designated as '0_0'; both 7G8 and GB4 are 
the same but different from 3D7 ( f l_V); 7G8 is different 
from 3D7 and GB4 is not ('1_0'); and GB4 is different 
from 3D7, and 7G8 is not ( , 0_1'). From these SFP calls, 
we selected probes that have differential SFP calls 
between the two parents (that is, one parent was T and 
the other one was '0'), then signals from the probes of 
each progeny were individually assigned based on com- 
parisons to the signals from the two parents. Because 
single-probe calls were shown to be error prone [10], an 
SFP was called only if at least two continuous probes 
indicated a polymorphism. To avoid calls from overlap- 
ping redundant probes, we collapsed all probes over- 
lapped within 25 bp into one SFP (mSFP) [10]. 

Assignment of parental genotype calls 

A quick scan of the genotype inheritance revealed many 
double crossovers within small DNA segments that were 
likely genotype calling errors in the progeny, particularly 
when the same double crossovers occurred in multiple 
progeny (vertical lines in Additional file 2). Assuming a 
recombination rate of 1% (1 cM) per 10 kb 



(approximately the average spacing of the mSFP mar- 
kers) and no genetic interference, the probability of hav- 
ing two crossovers in two consecutive marker intervals 
is about 1%. However, almost 50% of the crossovers are 
adjacent to each other in the uncorrected genotypes, 
which suggested that most of these double-crossovers 
are errors because it is unlikely that multiple progeny 
have the same crossovers within a small segment of the 
chromosome. For instance, the probability for one pro- 
geny to have two crossovers with six markers (approxi- 
mately 50 kb) is about 5%. For another progeny to have 
crossovers at the exact two intervals is 0.01 x 0.01 = 
0.01%. Therefore, the probability of two progeny having 
the same two crossovers at the same two intervals 
within 6 markers is 0.05 x 0.01 x 0.01 = 5 x 10" 6 . To 
reduce excessive variability due to potential genotype 
calling errors, we applied the following steps to filter 
out the double crossovers within short distances (poten- 
tial genotype calling errors). We first combined our 
mSFP calls with the genotypes from 254 MS markers 
ordered by physical positions [9] and imputed 31 miss- 
ing MS genotypes using the nearest mSFP markers. We 
then searched for single mSFP markers of one parental 
genotype that were flanked by two markers of the other 
parental genotype, that is, 1-0-1 or 0-1-0, and removed 
the middle markers in the likelihood that they were 
erroneous. We used an iterative process to identify dou- 
ble crossovers with single mSFP markers across the 
chromosomes, starting with those having the largest 
numbers of progeny with the same switching pattern 
and corrected the single mSFP genotypes. Genotypes 
from MS markers were not corrected. We also corrected 
double crossovers with two alternative genotypes in 
between (0-1-1-0) if there were two or more progeny 
that had the same double crossovers. Although double 
crossovers with two alternative genotypes may occur by 
chance, the likelihood of more than one progeny having 
the same pattern is very low (<5.0E-6). Again, if the MS 
markers also indicated a double crossover, no correc- 
tions were made. 

After the computational cleanups, we designed 35 
pairs of PCR primers to experimentally validate 14 dou- 
ble crossovers in 21 progeny and 5 single crossover 
events (Additional file 3). Based on the results from the 
experimental data, we manually corrected false double 
crossovers by two criteria: potentially erroneous double 
crossover calls reported from DNA segments smaller 
than 60 kb containing central markers with different 
genotypes from flanking markers; and the DNA segment 
has fewer than five mSFPs with genotypes different from 
those flanking the segment. We also re-typed the 31 
MSs that had mismatches with our mSFP genotypes. 
PCR products were separated in a QIAexcel machine, 
and MS genotypes (sizes in base pairs) were scored. The 
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final genotype calls and the inheritance of each marker 
were displayed in Excel spread sheets (Additional file 3). 

Estimating RR and construction of a high-resolution 
genetic map 

We used the Haldane map function to estimate genetic 
distance and calculated RR over a 5-kb window: 

r = I (i _ e -2^ (1) 

where, 0 < r < 0.5, is the recombination fraction (the 
fraction of recombinant offspring showing a crossover 
between two adjacent markers), d is the physical distance 
(in kilobases), and X is the genetic distance (Morgan) per 
unit distance (kb). For more than one marker interval, if 
we assume X is constant, the likelihood function is: 

k 

L=n*f(i-nr^ (2) 

1=1 

where x t is the number of recombinants with regard 
to marker interval i and n is the total number of pro- 
geny; k is the total number of marker intervals; and r t is 
the recombination fraction at marker interval i. Substi- 
tute in Equation 1, we have: 

L = U(l-e- 2Xd f(l + e- 2Xd -y~ Xi (3) 

1=1 

The maximum likelihood estimator of X can be 
numerically derived by maximizing the log-likelihood 
function using the Newton-Raphson algorithm: 

k 

l = \ogL = J2 [ x i lo § (l - e ~ 2Xdl ) + ( n ~ x i) log (l + <T 2U )] (4) 
i=i 

where d b is the physical distance at marker interval L 
To estimate the map distance using Kosambi map 
function, Equation 1 was replaced with: 

r = i tanh(2Ad) (5) 

where tanh is the hyperbolic tangent function and: 

e 2x — 1 

tanh(x) = — (6) 

w e 2x + 1 

Coefficient of coincidence (Z) as a function of inter- 
crossover distance in megabases was estimated using the 
methods described [15]. 

Identification of recombination hotspots 

We used overlapping 5-kb sliding windows to scan 
through the markers on each chromosome for recombi- 
nation hotspots. For each scanning window, we selected 



all the markers if there were two or more markers 
within the window. For example, if the distance between 
marker 1 and 4 was less than 5 kb but the distance 
between 1 and 5 was over 5 kb, we would include mar- 
kers 1 to 4 (that is, three marker intervals) in the first 
estimate of RR. In cases where the next nearest marker 
was more than 5 kb away, we used the consecutive mar- 
ker pair. For each such marker pair or a window, we 
used the methods outlined above to estimate the RR 
and confidence intervals. A set of markers was labeled 
as a candidate recombination hotspot if there were two 
or more recombination events among 32 progeny and 
the estimated RR was at least five times higher than the 
genome-wide average. Because the windows are overlap- 
ping, the selected marker intervals might also be over- 
lapping. In this case, we only selected the marker 
interval that had the highest lower 95% confidence limit, 
which generally implied the most recombination events 
in the shortest marker interval. The same criteria were 
used to select recombination hotspots from 35 indepen- 
dent progeny of the Dd2 x HB3 cross. 

Search of conserved motifs in breakpoints and 
recombination hotspots 

We used MEME Suite, a motif discovery toolkit [20], to 
search for common motifs in recombination hotspot 
sequences. Methods of anr (any number of repetitions) 
and oops (one occurrence per sequence) were used to 
discover motifs that were enriched in hotspot sequences 
using various motif widths, including 50 bp (default) 
and variable widths from 7 bp to 100 bp. Non-AT core 
motifs discovered were counted using FIMO (find indi- 
vidual motif occurrences) using a corresponding score 
matrix with a P-value cutoff of 5.0E-6 and overlapping 
counts removed, and AT-repeats and A/T stretches 
were counted using in-house scripts to match 100% of 
the character. We counted and compared the frequen- 
cies of the motifs in hotspots and the whole genome as 
well as matched coldspot sequences. Coldspot sequences 
were randomly selected sequences outside the hotspots 
to match each hotspot with the same length, similar GC 
contents (±2%), and chromosomal region (variable 
region or not). A Poisson test and generalized estimat- 
ing equations were used to determine whether any dif- 
ferences in motif frequency were significant. The 
hotspot sequences and mapped genes were also analyzed 
for enrichment in Gene Ontology terms according to 
methods described previously [10]. We also searched all 
the crossover sites (breakpoints) with marker intervals 
smaller than 5 kb using the same methods. 

Low-complexity region-mediated recombination 

Low- complexity regions were located in the 43 nonsub- 
telomeric hotspots and identified using methods 
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described by DePristo et al. [18]. A total of 427 regions 
were then extracted and examined for AT content and 
sequence regularity (minisatellite, MS, or heterogeneous 
repeat) as described [19]. 

Additional material 



Additional file 11: Length, recombination rate, chromosome 
location, and AT content of DNA sequences in the recombination 
hotspots of the 7G8 x GB4 and Dd2 x HB3 genetic crosses. 

Additional file 12: Significantly enriched Gene Ontology terms 
(genes) in the recombination hotspots. 

Additional file 13: Putative motifs enriched in hotspots and their 
occurrence frequencies (counts/kb). 

Additional file 14: Aligned amino acid sequences of the putative P. 
falciparum zinc finger protein (PFL0465c) and those of human and 
mouse PRDM9 proteins. Human PRDM9 was used to blast the P. 
falciparum genome database [16], and PFL0465c was the protein with 
the highest score (192) and the lowest P-value (3.0E-15). The color coded 
domains are: yellow, P. falciparum zinc fingers; green, human PRDM9 zinc 
fingers; grey, KRAB box; cyan, SET domain; purple, putative eukaryotic 
DNA topoisomerase I DNA binding domain. The domains/motifs were 
identified using GenomeNet Motif Search [42], 
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bp: base pair; KRAB: Kruppel associated box; MS: microsatellite; mSFP: 
multiprobe SFP; pLCR: protein low-complexity region; RR: recombination 
rate; SFP: single feature polymorphism. 
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